AITopics | mechanistic interpretability

Collaborating Authors

mechanistic interpretability

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

This startup's new mechanistic interpretability tool lets you debug LLMs

MIT Technology ReviewApr-30-2026, 15:59:41 GMT

This startup's new mechanistic interpretability tool lets you debug LLMs Goodfire wants to make training AI models more like good old-fashioned software engineering. The San Francisco-based startup Goodfire just released a new tool, called Silico, that lets researchers and engineers peer inside an AI model and adjust its parameters--the settings that determine a model's behavior --during training. This could give model makers more fine-grained control over how this technology is built than was once thought possible. Goodfire claims Silico is the first off-the-shelf tool of its kind that can help developers debug all stages of the development process, from building a data set to training a model. LLMs contain a LOT of parameters. The company says its mission is to make building AI models less like alchemy and more like a science.

large language model, machine learning, natural language, (17 more...)

MIT Technology Review

Country: North America > United States > California > San Francisco County > San Francisco (0.25)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

b4aadf04d6fde46346db455402860708-Paper-Conference.pdf

Neural Information Processing SystemsApr-29-2026, 12:24:53 GMT

artificial intelligence, interpretability, machine learning, (18 more...)

Neural Information Processing Systems

Country:

North America > United States (0.46)
Europe > Germany (0.28)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

There Will Be a Scientific Theory of Deep Learning

Simon, Jamie, Kunin, Daniel, Atanasov, Alexander, Boix-Adserà, Enric, Bordelon, Blake, Cohen, Jeremy, Ghosh, Nikhil, Guth, Florentin, Jacot, Arthur, Kamb, Mason, Karkada, Dhruva, Michaud, Eric J., Ottlik, Berkan, Turnbull, Joseph

arXiv.org Machine LearningApr-24-2026

In this paper, we make the case that a scientific theory of deep learning is emerging. By this we mean a theory which characterizes important properties and statistics of the training process, hidden representations, final weights, and performance of neural networks. We pull together major strands of ongoing research in deep learning theory and identify five growing bodies of work that point toward such a theory: (a) solvable idealized settings that provide intuition for learning dynamics in realistic systems; (b) tractable limits that reveal insights into fundamental learning phenomena; (c) simple mathematical laws that capture important macroscopic observables; (d) theories of hyperparameters that disentangle them from the rest of the training process, leaving simpler systems behind; and (e) universal behaviors shared across systems and settings which clarify which phenomena call for explanation. Taken together, these bodies of work share certain broad traits: they are concerned with the dynamics of the training process; they primarily seek to describe coarse aggregate statistics; and they emphasize falsifiable quantitative predictions. We argue that the emerging theory is best thought of as a mechanics of the learning process, and suggest the name learning mechanics. We discuss the relationship between this mechanics perspective and other approaches for building a theory of deep learning, including the statistical and information-theoretic perspectives. In particular, we anticipate a symbiotic relationship between learning mechanics and mechanistic interpretability. We also review and address common arguments that fundamental theory will not be possible or is not important. We conclude with a portrait of important open directions in learning mechanics and advice for beginners. We host further introductory materials, perspectives, and open questions at learningmechanics.pub.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Machine Learning

2604.21691

Country:

Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
North America > United States (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.75)

Add feedback

Identifying interactions at scale for LLMs

AIHubApr-21-2026, 13:37:46 GMT

Understanding the behavior of complex machine learning systems, particularly Large Language Models (LLMs), is a critical challenge in modern artificial intelligence. Interpretability research aims to make the decision-making process more transparent to model builders and impacted humans, a step toward safer and more trustworthy AI. To achieve state-of-the-art performance, models synthesize complex feature relationships, find shared patterns from diverse training examples, and process information through highly interconnected internal components. In this blog post, we describe the fundamental ideas behind SPEX and ProxySPEX, algorithms capable of identifying these critical interactions at scale. We mask or remove specific segments of the input prompt and measure the resulting shift in the predictions.

large language model, machine learning, natural language, (18 more...)

AIHub

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Compact Proofs of Model Performance via Mechanistic Interpretability

Neural Information Processing SystemsMar-21-2026, 14:57:39 GMT

We propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance.We prototype this approach by formally proving accuracy lower bounds for a small transformer trained on Max-of-$K$, validating proof transferability across 151 random seeds and four values of $K$.We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models.Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding.Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds.We confirm these connections by qualitatively examining a subset of our proofs.Finally, we identify compounding structureless errors as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.

artificial intelligence, name change, proceedings, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.41)

Add feedback

Scale Alone Does not Improve Mechanistic Interpretability in Vision Models

Neural Information Processing SystemsFeb-16-2026, 16:33:35 GMT

In light of the recent widespread adoption of AI systems, understanding the internal information processing of neural networks has become increasingly critical.

artificial intelligence, interpretability, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.05)
Oceania > New Zealand (0.04)
(5 more...)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Meet the new biologists treating LLMs like aliens

MIT Technology ReviewJan-12-2026, 11:00:00 GMT

By studying large language models as if they were living things instead of computer programs, scientists are discovering some of their secrets for the first time. How large is a large language model? Think about it this way. In the center of San Francisco there's a hill called Twin Peaks from which you can view nearly the entire city. Picture all of it--every block and intersection, every neighborhood and park, as far as you can see--covered in sheets of paper. Now picture that paper filled with numbers. LLMs contain a LOT of parameters. That's one way to visualize a large language model, or at least a medium-size one: Printed out in 14-point type, a 200-billion-parameter model, such as GPT4o (released by OpenAI in 2024), could fill 46 square miles of paper--roughly enough to cover San Francisco.

language model, openai, reasoning model, (13 more...)

MIT Technology Review

Country:

North America > United States > California > San Francisco County > San Francisco (0.44)
Pacific Ocean > North Pacific Ocean > San Francisco Bay > Golden Gate (0.04)
North America > United States > Massachusetts (0.04)
North America > United States > California > Los Angeles County > Los Angeles (0.04)

Industry:

Health & Medicine (0.47)
Media (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Scale Alone Does not Improve Mechanistic Interpretability in Vision Models

Neural Information Processing SystemsDec-26-2025, 14:28:00 GMT

In light of the recent widespread adoption of AI systems, understanding the internal information processing of neural networks has become increasingly critical. Most recently, machine vision has seen remarkable progress by scaling neural networks to unprecedented levels in dataset and model size. We here ask whether this extraordinary increase in scale also positively impacts the field of mechanistic interpretability. In other words, has our understanding of the inner workings of scaled neural networks improved as well? We use a psychophysical paradigm to quantify one form of mechanistic interpretability for a diverse suite of nine models and find no scaling effect for interpretability - neither for model nor dataset size. Specifically, none of the investigated state-of-the-art models are easier to interpret than the GoogLeNet model from almost a decade ago.

mechanistic interpretability, name change, scale alone, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.73)

Add feedback

Towards Automated Circuit Discovery for Mechanistic Interpretability

Neural Information Processing SystemsDec-24-2025, 13:46:28 GMT

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors oftransformer models. This paper systematizes the mechanistic interpretability process they followed. First, researcherschoose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find whichabstract neural network units are involved in the behavior. By varying the dataset, metric, and units underinvestigation, researchers can understand the functionality of each component.We automate one of the process' steps: finding the connections between the abstract neural network units that form a circuit. We propose several algorithms and reproduce previous interpretability results to validate them. Forexample, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes theGreater-Than operation. ACDC selected 68 of the 32,000 edges in GPT-2 Small, all of which were manually found byprevious work.

automated circuit discovery, mechanistic interpretability, name change, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

Sparse Attention Post-Training for Mechanistic Interpretability

Draye, Florent, Lei, Anson, Posner, Ingmar, Schölkopf, Bernhard

arXiv.org Artificial IntelligenceDec-8-2025

We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 1B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.3 \%$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2512.05865

Country: Europe > Switzerland (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback